Skip to content

Conversation

@hasnain2808
Copy link
Contributor

@hasnain2808 hasnain2808 commented May 4, 2020

Add the nrows to read_json parameter that returns only the required number of json from the line delimited json

@hasnain2808 hasnain2808 changed the title Add nrows to read json Add nrows to read json. May 4, 2020
@hasnain2808 hasnain2808 force-pushed the add-nrows-to-read-json branch from 569db1c to fc4993f Compare May 4, 2020 07:13
@jreback jreback added IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance labels May 5, 2020
@hasnain2808
Copy link
Contributor Author

Hey @jreback
I have made all the change
As for typing args I tried for All of them but looks like I don't know much, so added as much as possible including nrows that were made necessary. Will try raising a new PR which will add typing hint for all args later.
I hope everything's in order
It is ready to be reviewed
Regards.

@hasnain2808 hasnain2808 requested a review from jreback May 7, 2020 13:15
@hasnain2808 hasnain2808 force-pushed the add-nrows-to-read-json branch 2 times, most recently from f72ea60 to 896de23 Compare May 9, 2020 21:49
@hasnain2808 hasnain2808 force-pushed the add-nrows-to-read-json branch from 896de23 to ca9c3e0 Compare May 10, 2020 01:59
@hasnain2808 hasnain2808 requested review from WillAyd and jreback May 10, 2020 02:43
Co-authored-by: William Ayd <william.ayd@icloud.com>
@hasnain2808 hasnain2808 force-pushed the add-nrows-to-read-json branch from 40a5c0a to a0a55c9 Compare May 19, 2020 17:28
@hasnain2808 hasnain2808 force-pushed the add-nrows-to-read-json branch from a0a55c9 to 74e9c2b Compare May 20, 2020 07:22
@hasnain2808 hasnain2808 requested a review from WillAyd May 20, 2020 08:29
@hasnain2808 hasnain2808 requested a review from WillAyd May 20, 2020 16:14
@hasnain2808 hasnain2808 force-pushed the add-nrows-to-read-json branch from d386e49 to ecdbc10 Compare May 22, 2020 19:07
@hasnain2808 hasnain2808 marked this pull request as ready for review June 2, 2020 18:35
@hasnain2808
Copy link
Contributor Author

hasnain2808 commented Jun 2, 2020

Hey @jreback

Hope you are having a great day

So this PR is basically for the addition of the feature of selecting the number of rows to be returned when the lines parameter is set.

According to the benchmarks I believe that feature is working fine.

So is it possible that we merge this feature and then create a new issue for the concerns that we are having with the chunk size feature? I have got a grip on this section of the codebase so I should be able to do that too but I believe we should have separate PR's for them.

Thanks and Regards

@jreback jreback added this to the 1.1 milestone Jun 2, 2020
@jreback
Copy link
Contributor

jreback commented Jun 2, 2020

@hasnain2808 yep that sounds fine. please add a whatsnew (about the nrows feature) and create an issue about chunksize & perf which can certainly be a followup.

ping on green (for the added whatsnew), put in other enhancements in 1.1

@hasnain2808
Copy link
Contributor Author

Hey @jreback
Added the whatsnew and it's green
I hope everything's in order to merge.
Thanks.

@WillAyd
Copy link
Member

WillAyd commented Jun 3, 2020

@hasnain2808 looks like another merge conflict - can you fix up?

So the benchmarks did show an improvement right? Not entirely clear from the link that you sent; usually at the end of the run you should get something saying PERFORMANCE IMPROVED . I might be overlooking that from what you shared

As long as that shows an improvement this lgtm

@hasnain2808
Copy link
Contributor Author

hasnain2808 commented Jun 4, 2020

@WillAyd

Hope you're having a nice day

I will fix the merge conflicts

Yes the memory consumption and time required is low when we use the nrows parameter.

As the nrows parameter is optional, we added a new benchmark for this parameter hence, the benchmark results do not show the PERFORMANCE IMPROVED verdict.

Pasted the results into the spreadsheet to compare easily
Link

Raw dumps are here
https://pastebin.com/E22PAxKe

Please do tell me if there is something else that's needed to be done

…d-nrows-to-read-json

"solve merge conflicts while merging to master"
@hasnain2808
Copy link
Contributor Author

hasnain2808 commented Jun 4, 2020

Fixed the merge conflicts.
I hope we merge this before some one else adds a what's new into master

@hasnain2808 hasnain2808 force-pushed the add-nrows-to-read-json branch from e9c54fe to 2ce74db Compare June 4, 2020 05:27
@hasnain2808
Copy link
Contributor Author

hasnain2808 commented Jun 4, 2020

I do not think this failure is related
It started coming once I merged master into my branch to resolve merge conflicts

The benchmark is failing around 0.64%

##[error][  0.65%] ··· arithmetic.ApplyIndex.time_apply_index                 3/10 failed
[  0.65%] ··· =================================== ==========
                             offset                         
              ----------------------------------- ----------
                      <YearEnd: month=12>          1.55±0ms 
                      <YearBegin: month=1>         1.37±0ms 
                 <QuarterEnd: startingMonth=3>     1.71±0ms 
                <QuarterBegin: startingMonth=3>    1.59±0ms 
                           <MonthEnd>              2.25±0ms 
                          <MonthBegin>             1.34±0ms 
                 <DateOffset: days=2, months=2>    3.16±0ms 
                         <BusinessDay>              failed  
                <SemiMonthEnd: day_of_month=15>     failed  
               <SemiMonthBegin: day_of_month=15>    failed  
              =================================== ==========

[  0.65%] ···· For parameters: <BusinessDay>
               Traceback (most recent call last):
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1184, in main_run_server
                   main_run(run_args)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1058, in main_run
                   result = benchmark.do_run()
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 537, in do_run
                   return self.run(*self._current_params)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 627, in run
                   samples, number = self.benchmark_timing(timer, min_repeat, max_repeat,
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 694, in benchmark_timing
                   timing = timer.timeit(number)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/timeit.py", line 177, in timeit
                   timing = self.inner(it, self.timer)
                 File "<timeit-src>", line 6, in inner
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 599, in <lambda>
                   func = lambda: self.func(*param)
                 File "/home/runner/work/pandas/pandas/asv_bench/benchmarks/arithmetic.py", line 469, in time_apply_index
                   offset.apply_index(self.rng)
                 File "pandas/_libs/tslibs/offsets.pyx", line 87, in pandas._libs.tslibs.offsets.apply_index_wraps.wrapper
                 File "pandas/_libs/tslibs/offsets.pyx", line 1397, in pandas._libs.tslibs.offsets.BusinessDay.apply_index
               AttributeError: 'PeriodIndex' object has no attribute '_addsub_int_array'
               
               For parameters: <SemiMonthEnd: day_of_month=15>
               Traceback (most recent call last):
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1184, in main_run_server
                   main_run(run_args)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1058, in main_run
                   result = benchmark.do_run()
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 537, in do_run
                   return self.run(*self._current_params)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 627, in run
                   samples, number = self.benchmark_timing(timer, min_repeat, max_repeat,
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 694, in benchmark_timing
                   timing = timer.timeit(number)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/timeit.py", line 177, in timeit
                   timing = self.inner(it, self.timer)
                 File "<timeit-src>", line 6, in inner
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 599, in <lambda>
                   func = lambda: self.func(*param)
                 File "/home/runner/work/pandas/pandas/asv_bench/benchmarks/arithmetic.py", line 469, in time_apply_index
                   offset.apply_index(self.rng)
                 File "pandas/_libs/tslibs/offsets.pyx", line 87, in pandas._libs.tslibs.offsets.apply_index_wraps.wrapper
                 File "pandas/_libs/tslibs/offsets.pyx", line 2319, in pandas._libs.tslibs.offsets.SemiMonthOffset.apply_index
               AttributeError: 'PeriodIndex' object has no attribute '_addsub_int_array'
               
               For parameters: <SemiMonthBegin: day_of_month=15>
               Traceback (most recent call last):
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1184, in main_run_server
                   main_run(run_args)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1058, in main_run
                   result = benchmark.do_run()
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 537, in do_run
                   return self.run(*self._current_params)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 627, in run
                   samples, number = self.benchmark_timing(timer, min_repeat, max_repeat,
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 694, in benchmark_timing
                   timing = timer.timeit(number)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/timeit.py", line 177, in timeit
                   timing = self.inner(it, self.timer)
                 File "<timeit-src>", line 6, in inner
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 599, in <lambda>
                   func = lambda: self.func(*param)
                 File "/home/runner/work/pandas/pandas/asv_bench/benchmarks/arithmetic.py", line 469, in time_apply_index
                   offset.apply_index(self.rng)
                 File "pandas/_libs/tslibs/offsets.pyx", line 87, in pandas._libs.tslibs.offsets.apply_index_wraps.wrapper
                 File "pandas/_libs/tslibs/offsets.pyx", line 2319, in pandas._libs.tslibs.offsets.SemiMonthOffset.apply_index
               AttributeError: 'PeriodIndex' object has no attribute '_addsub_int_array'

@hasnain2808 hasnain2808 force-pushed the add-nrows-to-read-json branch from 18bf6e7 to 133aef9 Compare June 4, 2020 14:34
@hasnain2808
Copy link
Contributor Author

ping

@hasnain2808
Copy link
Contributor Author

Ping
@jreback it's all green.

@jreback jreback merged commit 89c5a59 into pandas-dev:master Jun 4, 2020
@jreback
Copy link
Contributor

jreback commented Jun 4, 2020

thanks @hasnain2808

@hasnain2808
Copy link
Contributor Author

Thanks @jreback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: Add nrows parameter to pd.read_json

4 participants